Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junwei Bao

GVPO: Group Variance Policy Optimization for Large Language Model Post-Training

Apr 28, 2025

Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, Hui Xiong

Abstract:Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO), leverage increased sampling with relative reward scoring to achieve superior performance, these methods often suffer from training instability that limits their practical adoption. To address this challenge, we present Group Variance Policy Optimization (GVPO). GVPO incorporates the analytical solution to KL-constrained reward maximization directly into its gradient weights, ensuring alignment with the optimal policy. The method provides intuitive physical interpretations: its gradient mirrors the mean squared error between the central distance of implicit rewards and that of actual rewards. GVPO offers two key advantages: (1) it guarantees a unique optimal solution, exactly the KL-constrained reward maximization objective, (2) it supports flexible sampling distributions that avoids on-policy and importance sampling limitations. By unifying theoretical guarantees with practical adaptability, GVPO establishes a new paradigm for reliable and versatile LLM post-training.

Via

Access Paper or Ask Questions

Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

Dec 18, 2024

Yuzhong Hong, Hanshan Zhang, Junwei Bao, Hongfei Jiang, Yang Song

Abstract:Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently,the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based model (EBM) that always has a unique MLE, inherently satisfying the linearity requirement. To approximate the MLE in practice, we propose a contrastive loss named Energy Preference Alignment (EPA), wherein each positive sample is contrasted against one or more strong negatives as well as many free weak negatives. Theoretical properties of our EBM enable the approximation error of EPA to almost surely vanish when a sufficient number of negatives are used. Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby showing the superiority of our EBM.

Via

Access Paper or Ask Questions

Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models

Dec 17, 2024

Yuchen Fan, Yuzhong Hong, Qiushi Wang, Junwei Bao, Hongfei Jiang, Yang Song

Figure 1 for Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models

Figure 2 for Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models

Figure 3 for Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models

Figure 4 for Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models

Abstract:Alignment, endowing a pre-trained Large language model (LLM) with the ability to follow instructions, is crucial for its real-world applications. Conventional supervised fine-tuning (SFT) methods formalize it as causal language modeling typically with a cross-entropy objective, requiring a large amount of high-quality instruction-response pairs. However, the quality of widely used SFT datasets can not be guaranteed due to the high cost and intensive labor for the creation and maintenance in practice. To overcome the limitations associated with the quality of SFT datasets, we introduce a novel \textbf{p}reference-\textbf{o}riented supervised \textbf{f}ine-\textbf{t}uning approach, namely PoFT. The intuition is to boost SFT by imposing a particular preference: \textit{favoring the target model over aligned LLMs on the same SFT data.} This preference encourages the target model to predict a higher likelihood than that predicted by the aligned LLMs, incorporating assessment information on data quality (i.e., predicted likelihood by the aligned LLMs) into the training process. Extensive experiments are conducted, and the results validate the effectiveness of the proposed method. PoFT achieves stable and consistent improvements over the SFT baselines across different training datasets and base models. Moreover, we prove that PoFT can be integrated with existing SFT data filtering methods to achieve better performance, and further improved by following preference optimization procedures, such as DPO.

* AAAI2025, 12 pages, 9 figures

Via

Access Paper or Ask Questions

BoRA: Bi-dimensional Weight-Decomposed Low-Rank Adaptation

Dec 09, 2024

Qiushi Wang, Yuchen Fan, Junwei Bao, Hongfei Jiang, Yang Song

Figure 1 for BoRA: Bi-dimensional Weight-Decomposed Low-Rank Adaptation

Figure 2 for BoRA: Bi-dimensional Weight-Decomposed Low-Rank Adaptation

Figure 3 for BoRA: Bi-dimensional Weight-Decomposed Low-Rank Adaptation

Figure 4 for BoRA: Bi-dimensional Weight-Decomposed Low-Rank Adaptation

Abstract:In recent years, Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) have significantly enhanced the adaptability of large-scale pre-trained models. Weight-Decomposed Low-Rank Adaptation (DoRA) improves upon LoRA by separating the magnitude and direction components of the weight matrix, leading to superior performance. However, DoRA's improvements are limited to the vertical dimension, resulting in an asymmetrical pattern between horizontal and vertical dimensions. This paper introduces BoRA, an innovative extension of LoRA and DoRA, characterized by symmetrical properties across horizontal and vertical dimensions. Our approach optimizes the weight matrix symmetrically by adjusting both column-wise and row-wise magnitudes. Extensive experiments demonstrate that BoRA surpasses state-of-the-art PEFT methods, including LoRA and DoRA, achieving superior results across various benchmarks.

Via

Access Paper or Ask Questions

Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models

Aug 09, 2024

Guanming Xiong, Junwei Bao, Hongfei Jiang, Yang Song, Wen Zhao

Figure 1 for Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models

Figure 2 for Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models

Figure 3 for Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models

Figure 4 for Interactive-T2S: Multi-Turn Interactions for Text-to-SQL with Large Language Models

Abstract:This study explores text-to-SQL parsing by leveraging the powerful reasoning capabilities of large language models (LLMs). Despite recent advancements, existing LLM-based methods have not adequately addressed scalability, leading to inefficiencies when processing wide tables. Furthermore, current interaction-based approaches either lack a step-by-step, interpretable SQL generation process or fail to provide an efficient and universally applicable interaction design. To address these challenges, we introduce Interactive-T2S, a framework that generates SQL queries through direct interactions with databases. This framework includes four general tools that facilitate proactive and efficient information retrieval by the LLM. Additionally, we have developed detailed exemplars to demonstrate the step-wise reasoning processes within our framework. Our experiments on the BIRD-Dev dataset, employing a setting without oracle knowledge, reveal that our method achieves state-of-the-art results with only two exemplars, underscoring the effectiveness and robustness of our framework.

* 15 pages, 7 figures

Via

Access Paper or Ask Questions

Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models

Feb 23, 2024

Guanming Xiong, Junwei Bao, Wen Zhao

Figure 1 for Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models

Figure 2 for Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models

Figure 3 for Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models

Figure 4 for Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models

Abstract:This study explores the realm of knowledge-base question answering (KBQA). KBQA is considered a challenging task, particularly in parsing intricate questions into executable logical forms. Traditional semantic parsing (SP)-based methods require extensive data annotations, which result in significant costs. Recently, the advent of few-shot in-context learning, powered by large language models (LLMs), has showcased promising capabilities. Yet, fully leveraging LLMs to parse questions into logical forms in low-resource scenarios poses a substantial challenge. To tackle these hurdles, we introduce Interactive-KBQA, a framework designed to generate logical forms through direct interaction with knowledge bases (KBs). Within this framework, we have developed three generic APIs for KB interaction. For each category of complex question, we devised exemplars to guide LLMs through the reasoning processes. Our method achieves competitive results on the WebQuestionsSP, ComplexWebQuestions, KQA Pro, and MetaQA datasets with a minimal number of examples (shots). Importantly, our approach supports manual intervention, allowing for the iterative refinement of LLM outputs. By annotating a dataset with step-wise reasoning processes, we showcase our model's adaptability and highlight its potential for contributing significant enhancements to the field.

* Codes will be released upon acceptance

Via

Access Paper or Ask Questions

HopPG: Self-Iterative Program Generation for Multi-Hop Question Answering over Heterogeneous Knowledge

Sep 10, 2023

Yingyao Wang, Yongwei Zhou, Chaoqun Duan, Junwei Bao, Tiejun Zhao

Figure 1 for HopPG: Self-Iterative Program Generation for Multi-Hop Question Answering over Heterogeneous Knowledge

Figure 2 for HopPG: Self-Iterative Program Generation for Multi-Hop Question Answering over Heterogeneous Knowledge

Figure 3 for HopPG: Self-Iterative Program Generation for Multi-Hop Question Answering over Heterogeneous Knowledge

Figure 4 for HopPG: Self-Iterative Program Generation for Multi-Hop Question Answering over Heterogeneous Knowledge

Abstract:The semantic parsing-based method is an important research branch for knowledge-based question answering. It usually generates executable programs lean upon the question and then conduct them to reason answers over a knowledge base. Benefit from this inherent mechanism, it has advantages in the performance and the interpretability. However, traditional semantic parsing methods usually generate a complete program before executing it, which struggles with multi-hop question answering over heterogeneous knowledge. On one hand, generating a complete multi-hop program relies on multiple heterogeneous supporting facts, and it is difficult for generators to understand these facts simultaneously. On the other hand, this way ignores the semantic information of the intermediate answers at each hop, which is beneficial for subsequent generation. To alleviate these challenges, we propose a self-iterative framework for multi-hop program generation (HopPG) over heterogeneous knowledge, which leverages the previous execution results to retrieve supporting facts and generate subsequent programs hop by hop. We evaluate our model on MMQA-T^2, and the experimental results show that HopPG outperforms existing semantic-parsing-based baselines, especially on the multi-hop questions.

Via

Access Paper or Ask Questions

AUGUST: an Automatic Generation Understudy for Synthesizing Conversational Recommendation Datasets

Jun 16, 2023

Yu Lu, Junwei Bao, Zichen Ma, Xiaoguang Han, Youzheng Wu, Shuguang Cui, Xiaodong He

Figure 1 for AUGUST: an Automatic Generation Understudy for Synthesizing Conversational Recommendation Datasets

Figure 2 for AUGUST: an Automatic Generation Understudy for Synthesizing Conversational Recommendation Datasets

Figure 3 for AUGUST: an Automatic Generation Understudy for Synthesizing Conversational Recommendation Datasets

Figure 4 for AUGUST: an Automatic Generation Understudy for Synthesizing Conversational Recommendation Datasets

Abstract:High-quality data is essential for conversational recommendation systems and serves as the cornerstone of the network architecture development and training strategy design. Existing works contribute heavy human efforts to manually labeling or designing and extending recommender dialogue templates. However, they suffer from (i) the limited number of human annotators results in that datasets can hardly capture rich and large-scale cases in the real world, (ii) the limited experience and knowledge of annotators account for the uninformative corpus and inappropriate recommendations. In this paper, we propose a novel automatic dataset synthesis approach that can generate both large-scale and high-quality recommendation dialogues through a data2text generation process, where unstructured recommendation conversations are generated from structured graphs based on user-item information from the real world. In doing so, we comprehensively exploit: (i) rich personalized user profiles from traditional recommendation datasets, (ii) rich external knowledge from knowledge graphs, and (iii) the conversation ability contained in human-to-human conversational recommendation datasets. Extensive experiments validate the benefit brought by the automatically synthesized data under low-resource scenarios and demonstrate the promising potential to facilitate the development of a more effective conversational recommendation system.

* 10 pages

Via

Access Paper or Ask Questions

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Nov 27, 2022

Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, Tianrui Li

Figure 1 for SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Figure 2 for SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Figure 3 for SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Figure 4 for SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Abstract:Recently, the contrastive language-image pre-training, e.g., CLIP, has demonstrated promising results on various downstream tasks. The pre-trained model can capture enriched visual concepts for images by learning from a large scale of text-image data. However, transferring the learned visual knowledge to open-vocabulary semantic segmentation is still under-explored. In this paper, we propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation in an annotation-free manner. The SegCLIP achieves segmentation based on ViT and the main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. The gathering operation can dynamically capture the semantic groups, which can be used to generate the final segmentation results. We further propose a reconstruction loss on masked patches and a superpixel-based KL loss with pseudo-labels to enhance the visual representation. Experimental results show that our model achieves comparable or superior segmentation accuracy on the PASCAL VOC 2012 (+1.4% mIoU), PASCAL Context (+2.4% mIoU), and COCO (+5.6% mIoU) compared with baselines. We release the code at https://github.com/ArrowLuo/SegCLIP.

Via

Access Paper or Ask Questions

MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking

Nov 11, 2022

Haoning Zhang, Junwei Bao, Haipeng Sun, Youzheng Wu, Wenye Li, Shuguang Cui, Xiaodong He

Figure 1 for MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking

Figure 2 for MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking

Figure 3 for MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking

Figure 4 for MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking

Abstract:Dialogue state tracking (DST) aims to convert the dialogue history into dialogue states which consist of slot-value pairs. As condensed structural information memorizing all history information, the dialogue state in the last turn is typically adopted as the input for predicting the current state by DST models. However, these models tend to keep the predicted slot values unchanged, which is defined as state momentum in this paper. Specifically, the models struggle to update slot values that need to be changed and correct wrongly predicted slot values in the last turn. To this end, we propose MoNET to tackle state momentum via noise-enhanced training. First, the previous state of each turn in the training data is noised via replacing some of its slot values. Then, the noised previous state is used as the input to learn to predict the current state, improving the model's ability to update and correct slot values. Furthermore, a contrastive context matching framework is designed to narrow the representation distance between a state and its corresponding noised variant, which reduces the impact of noised state and makes the model better understand the dialogue history. Experimental results on MultiWOZ datasets show that MoNET outperforms previous DST methods. Ablations and analysis verify the effectiveness of MoNET in alleviating state momentum and improving anti-noise ability.

* 8 pages, 6 figures, 3 tables

Via

Access Paper or Ask Questions